Improving English-Russian sentence alignment through POS tagging and Damerau-Levenshtein distance
نویسنده
چکیده
The present paper introduces approach to improve English-Russian sentence alignment, based on POS-tagging of automatically aligned (by HunAlign) source and target texts. The initial hypothesis is tested on a corpus of bitexts. Sequences of POS tags for each sentence (exactly, nouns, adjectives, verbs and pronouns) are processed as “words” and DamerauLevenshtein distance between them is computed. This distance is then normalized by the length of the target sentence and is used as a threshold between supposedly mis-aligned and “good” sentence pairs. The experimental results show precision 0.81 and recall 0.8, which allows the method to be used as additional data source in parallel corpora alignment. At the same time, this leaves space for further improvement.
منابع مشابه
An improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملWord Similarity Calculation by Using the Edit Distance Metrics with Consonant Normalization
Edit distance metrics are widely used for many applications such as string comparison and spelling error corrections. Hamming distance is a metric for two equal length strings and Damerau-Levenshtein distance is a well-known metrics for making spelling corrections through string-to-string comparison. Previous distance metrics seems to be appropriate for alphabetic languages like English and Eur...
متن کاملبررسی مقایسهای تأثیر برچسبزنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی
In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...
متن کاملInformation Retrieval of Jumbled Words
It is known that humans can easily read words where the letters have been jumbled in a certain way. This paper examines this problem by associating a distance measure with the jumbling process. Modifications to text were generated according to the Damerau-Levenshtein distance and it was checked if the users are able to read it. Graphical representations of the results are provided.
متن کاملAligning Sentences and Words Using English-hindi Bilingual Parallel Corpora
This dissertation project relates to language engineering issues. The Enabling Minority Language Engineering (EMILLE) project is a collaborative work of The University of Sheffield and The Lancaster University. It aims to develop sixty-three million word electronic corpus of the South Asian Languages. As part of the EMILLE project, it was decided to develop a POS tagger for one of the languages...
متن کامل